Authorship Attribution and Optical Character Recognition Errors

نویسندگان

  • Patrick Juola
  • John Noecker
  • Michael Ryan
چکیده

Stylometric authorship attribution is a fundamental problem. The basic idea behind the research is that one can determine the authorship of a document on the basis of cognitive and linguistic quirks that uniquely identify a person. In many cases, however, noise in the original documents can make this analysis more difficult and less reliable. We investigate the errors introduced by a typical optical character recognition (OCR) process. Using simulated (random) errors in a standard benchmark corpus, we test to see how sensitive the authorship attribution process is to character mis-recognition. Our results indicate that, while accuracy decreases measurably with noise, the decrease is not substantial. RÉSUMÉ. Le problème de l’attribution stylométrique d’auteur est un problème fondamental. L’idée fondamentale derrière cette recherche est que l’on peut déterminer la paternité d’un document sur la base d’un ensemble de trait cognitifs et linguistiques qui permettent d’identifier de manière unique le style d’écriture d’une personne. Dans de nombreux cas, cependant, le bruit présent dans les documents originaux peut rendre cette analyse plus difficile et moins fiable. Nous étudions les erreurs introduites par un processus typique de reconnaissance optique de caractères (OCR). En utilisant des erreurs simulées (aléatoirement) dans un corpus de référence standard, nous évaluons la sensibilité au bruit du processus d’attribution d’auteur. Nos résultats indiquent que, bien que la précision diminue avec un niveau de bruit, cette baisse n’est pas substantielle.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm

This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Characte...

متن کامل

On musical stylometry—a pattern recognition approach

In this short communication we describe some experiments in which methods of statistical pattern recognition are applied for musical style recognition and disputed musical authorship attribution. Values of a set of 20 features (also called ‘‘style markers’’) are measured in the scores of a set of compositions, mainly describing the different sonorities in the compositions. For a first study ove...

متن کامل

On the Robustness of Authorship Attribution Based on Character N-gram Features

A number of independent authorship attribution studies have demonstrated the effectiveness of character n-gram features for representing the stylistic properties of text. However, the vast majority of these studies examined the simple case where the training and test corpora are similar in terms of genre, topic, and distribution of the texts. Hence, there are doubts whether such a simple and lo...

متن کامل

The use of sampling techniques in the retention of records: A RAMP study with guidelines

Optical Character Recognition (OCR) document. WARNING! Spelling errors might subsist. In order to access to the original document in image form, click on "Original" button on 1st page. Optical Character Recognition (OCR) document. WARNING! Spelling errors might subsist. In order to access to the original document in image form, click on "Original" button on 1st page. Optical Character Recogniti...

متن کامل

Authorship Attribution using Compression Distances

Authorship attribution has been a field of interest for researchers in the past, especially for forensic purposes. In this thesis, to obtain the degree of Bachelor of Science from the Leiden University, we investigate character n-grams and so-called compression distances to prototypes on several datasets, i.e., the datasets provided by PAN Labs (a benchmarking activity on uncovering plagiarism,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • TAL

دوره 53  شماره 

صفحات  -

تاریخ انتشار 2012